import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
scikit-learn intro¶
scikit-learn, or sklearn, is a Python package designed to give access to well-known machine learning algorithms within Python code, through a clean, well-thought-out API. It has been built by hundreds of contributors from around the world, and is used across industry and academia.
scikit-learn is built upon Python's NumPy (Numerical Python) and SciPy (Scientific Python) libraries, which enable efficient in-core numerical and scientific computation within Python. As such, scikit-learn is not specifically designed for extremely large datasets, though there is some work in this area.
Data Representation¶
Most machine learning algorithms implemented in scikit-learn expect a two-dimensional array or matrix X, usually represented as a NumPy ndarray. The expected shape of X is (n_samples, n_features).
n_samples: The number of samples, where each sample is an item to process (e.g., classify). A sample can be a document, a picture, a sound, a video, a row in database or CSV file, or whatever you can describe with a fixed set of quantitative traits.n_features: The number of features or distinct traits that can be used to describe each item in a quantitative manner.
The supervised machine learning algorithms implemented in scikit-learn also expect a one-dimensional array y with shape (n_samples,). This array associated a target class to every sample in the input X.

Supervised Learning¶

In the supervised learning paradigm the training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations.
After training, the fitted model does no longer expect the label y as an input: it will try to predict the most likely labels y_new for a new set of samples X_new.
Depending on the nature of the target y, supervised learning can be given different names:
- If
yhas values in a fixed set of categorical outcomes (represented by integers) the task to predictyis called classification. - If
yhas floating point values (e.g. to represent a price, a temperature, a size...), the task to predictyis called regression.
Unsupervised Learning¶
Unsupervised learning addresses a different sort of problem. Here the class labels of training data is unknown, and we are interested in finding similarities between the observations.
An unsupervised learning algorithm only uses a single set of observations X with shape (n_samples, n_features) and does not use any kind of labels.
Unsupervised learning comprises tasks such as dimensionality reduction and clustering.
The scikit-learn's estimator interface¶
Scikit-learn strives to have a uniform interface across all methods. The main objects in scikit-learn are the following:
- estimator (base object)
- predictor
- transformer
Note that one class can implement multiple interfaces.
Estimator¶
Every algorithm is exposed in scikit-learn via an estimator object
The behaviour of an estimator typically depends on a number of parameters. The parameters of an estimator can be set when it is instantiated or by modifying the corresponding attribute:
estimator = Estimator(param1=1, param2=2)
An Estimator implements a fit method to learn from data, either:
estimator = estimator.fit(data, targets), for supervised learning applications.estimator = estimator.fit(data), for unsupervised learning applications.
Predictor¶
For supervised learning, or some unsupervised problems, the predictor object implements the following method:
prediction = predictor.predict(data): given a "fitted" model, predict the label of a new set of data. This method accepts one argument, the array of observations, and returns the predicted label for each observation in the array.
Classification algorithms usually also offer a way to quantify certainty of a prediction, either using decision_function or predict_proba:
probability = predictor.predict_proba(data): clearly, the label with the highest probability is returned bypredictor.predict.
Transformer¶
The transformer object is used for filtering or modifying the data, in a supervised or unsupervised way. It implements the following method:
new_data = transformer.transform(data)
When fitting and transforming can be performed much more efficiently together than separately, implements:
new_data = transformer.fit_transform(data)
Check the docs

The sklearn.datasets module¶
sklearn.datasets docs
In our previous lectures, we have imported the iris-dataset from a ".csv" file
iris_df = pd.read_csv(os.path.join('dataset', 'iris.csv'))
Actually, scikit-learn comes with a few standard datasets. The sklearn.datasets module includes utilities to load datasets, including methods to load and fetch popular reference datasets. It also features some artificial data generators.
Load Iris dataset¶
The library embeds a copy of the Iris CSV file along with a helper function to load it into NumPy arrays.
from sklearn.datasets import load_iris
iris = load_iris()
type(iris)
sklearn.utils._bunch.Bunch
dir(iris)
['DESCR', 'data', 'data_module', 'feature_names', 'filename', 'frame', 'target', 'target_names']
print(iris.DESCR)
.. _iris_dataset:
Iris plants dataset
--------------------
**Data Set Characteristics:**
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:
============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================
:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988
The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.
This is perhaps the best known database to be found in the
pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.
.. dropdown:: References
- Fisher, R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...
The features of each sample flower are stored in the data attribute of the dataset:
n_samples, n_features = iris.data.shape
print(n_samples)
print(n_features)
print(iris.data[0])
150 4 [5.1 3.5 1.4 0.2]
The information about the class of each sample is stored in the target attribute of the dataset:
iris.data.shape, iris.target.shape
((150, 4), (150,))
print(iris.target)
iris.target_names
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
array(['setosa', 'versicolor', 'virginica'], dtype='<U10')
Know the class distribution from numpy arrays.
unique, counts = np.unique(iris.target, return_counts = True)
unique, counts
(array([0, 1, 2]), array([50, 50, 50]))
Equivalently, converting to a Pandas Series:
pd.Series(iris.target).value_counts()
0 50 1 50 2 50 Name: count, dtype: int64
Generate artificial data¶
from sklearn.datasets import make_blobs
X1, Y1 = make_blobs(n_samples = [1000, 100, 500],
n_features = 2,
random_state = 1)
plt.figure(figsize = (8, 8))
plt.title("Synthetic normally distributed dataset")
plt.scatter(X1[:, 0],
X1[:, 1],
marker = "o",
c = Y1,
s = 25,
edgecolor = "k")
plt.show()
Set random_state to get reproducible results:
f,axes = plt.subplots(1,3,figsize=(15, 5))
X1, Y1 = make_blobs(n_samples = [1000, 100, 500], n_features = 2, random_state = 42)
X2, Y2 = make_blobs(n_samples = [1000, 100, 500], n_features = 2, random_state = 42)
X3, Y3 = make_blobs(n_samples = [1000, 100, 500], n_features = 2, random_state = 21)
axes[0].scatter(X1[:, 0], X1[:, 1], marker = "o", c = Y1, s = 25, edgecolor = "k")
axes[0].set_title('random_state = 42')
axes[1].scatter(X2[:, 0], X2[:, 1], marker = "o", c = Y2, s = 25, edgecolor = "k")
axes[1].set_title('random_state = 42')
axes[2].scatter(X3[:, 0], X3[:, 1], marker = "o", c = Y3, s = 25, edgecolor = "k")
axes[2].set_title('random_state = 21')
f.suptitle('The effect of random seed')
plt.show()
The seed is used for initializing the pseudorandom data generator:
- the two processes with seed=42 lead to exactly the same result
- the process with seed=21 leads to a different result
Data preprocessing in scikit-learn¶
- āļø in previous notebooks
- ā
in this notebook
Major Tasks in data preprocessing:
- Data cleaning
- Handling missing values (on HYPOTHYROID dataset) āļø (e.g.,
sklearn.impute,pd.DataFrame.fillna) - Smooth noisy data
- Identify or remove outliers
- Resolve inconsistencies
- Encoding categorical features ā
(e.g.,
sklearn.preprocessing.OneHotEncoder,OrdinalEncoder)
- Handling missing values (on HYPOTHYROID dataset) āļø (e.g.,
- Data integration
- Data reduction
- Dimensionality reduction ā
(e.g.,
sklearn.decomposition.PCA) - Numerosity reduction
- Parametric methods
- Non parametric methods
- Sampling
- Simple random sampling with replacement
- Simple random sampling without replacement
- Sampling
- Dimensionality reduction ā
(e.g.,
- Data transformation and data discretization
- Normalization ā
(e.g.,
sklearn.preprocessing.MinMaxScaler,StandaradScaler,RobustScaler) - Concept Hierarchy Generation
- Discretization with pandas (on IRIS dataset) āļø (e.g.,
pd.cutandpd.qcut)- Equal width binning āļø
- Equal frequency binning āļø
- Normalization ā
(e.g.,
Encoding categorical features¶
Most sklearn estimators expect numerical variables in input. It is crucial to encode categorical variables into a numeric representation.
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
Ordinal encoder¶
Convert ordinal categorical features to integer codes. (Notice that an assumption is implicitly made on the magnitude of successive values)
X = [['Low', 'Low'], ['Medium', 'High'], ['High', 'Low']]
pd.DataFrame(X, columns = ['X','Y']) # pretty print as a dataframe
| X | Y | |
|---|---|---|
| 0 | Low | Low |
| 1 | Medium | High |
| 2 | High | Low |
enc = OrdinalEncoder() # by default, lexicographic order
enc.fit(X)
X_encoded = enc.transform(X)
pd.DataFrame(X_encoded, columns = ['X','Y']) # pretty print as a dataframe
| X | Y | |
|---|---|---|
| 0 | 1.0 | 1.0 |
| 1 | 2.0 | 0.0 |
| 2 | 0.0 | 1.0 |
enc = OrdinalEncoder(categories = [['Low', 'Medium', 'High'], ['Low', 'Medium', 'High']]) # set custom order
enc.fit(X)
X_encoded = enc.transform(X)
pd.DataFrame(X_encoded, columns = ['X','Y']) # pretty print as a dataframe
| X | Y | |
|---|---|---|
| 0 | 0.0 | 0.0 |
| 1 | 1.0 | 2.0 |
| 2 | 2.0 | 0.0 |
OneHot encoder¶
Encode nominal categorical features as a one-hot numeric array. (Nominal = 'categories, states, or "names of things"')
enc = OneHotEncoder()
# mainland, browser, smoker
X = [['US', 'Safari', 'yes'],
['Europe', 'Firefox', 'no'],
['Europe', 'Chrome', 'yes']]
pd.DataFrame(X, columns = ['mainland', 'browser', 'smoker']) # pretty print as a dataframe
| mainland | browser | smoker | |
|---|---|---|---|
| 0 | US | Safari | yes |
| 1 | Europe | Firefox | no |
| 2 | Europe | Chrome | yes |
enc.fit(X)
X_encoded = enc.transform(X).toarray()
X_encoded
array([[0., 1., 0., 0., 1., 0., 1.],
[1., 0., 0., 1., 0., 1., 0.],
[1., 0., 1., 0., 0., 0., 1.]])
enc.categories_
[array(['Europe', 'US'], dtype=object), array(['Chrome', 'Firefox', 'Safari'], dtype=object), array(['no', 'yes'], dtype=object)]
enc.get_feature_names_out(['mainland', 'browswer', 'smoker'])
array(['mainland_Europe', 'mainland_US', 'browswer_Chrome',
'browswer_Firefox', 'browswer_Safari', 'smoker_no', 'smoker_yes'],
dtype=object)
pd.DataFrame(X_encoded, columns = enc.get_feature_names_out(['mainland', 'browswer', 'smoker'])) # pretty print as a dataframe
| mainland_Europe | mainland_US | browswer_Chrome | browswer_Firefox | browswer_Safari | smoker_no | smoker_yes | |
|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 |
| 1 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
| 2 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
enc = OneHotEncoder(drop = 'first') # 'first' drop the first category in each feature
# mainland, browser, smoker
X = [['US', 'Safari', 'yes'],
['Europe', 'Firefox', 'no'],
['Europe', 'Chrome', 'yes']]
enc.fit(X)
X_encoded = enc.transform(X).toarray()
enc.get_feature_names_out(['mainland','browswer','smoker'])
array(['mainland_US', 'browswer_Firefox', 'browswer_Safari', 'smoker_yes'],
dtype=object)
pd.DataFrame(X_encoded, columns = enc.get_feature_names_out(['mainland', 'browswer', 'smoker'])) # pretty print as a dataframe
| mainland_US | browswer_Firefox | browswer_Safari | smoker_yes | |
|---|---|---|---|---|
| 0 | 1.0 | 0.0 | 1.0 | 1.0 |
| 1 | 0.0 | 1.0 | 0.0 | 0.0 |
| 2 | 0.0 | 0.0 | 0.0 | 1.0 |
Notice that the same transformation can be obtained by using the pandas pd.get_dummies utility function.
X = [['US', 'Safari', 'yes'],
['Europe', 'Firefox', 'no'],
['Europe', 'Chrome', 'yes']]
df = pd.DataFrame(X,columns = ['mainland','browser','smoker'])
df
| mainland | browser | smoker | |
|---|---|---|---|
| 0 | US | Safari | yes |
| 1 | Europe | Firefox | no |
| 2 | Europe | Chrome | yes |
pd.get_dummies(df)
| mainland_Europe | mainland_US | browser_Chrome | browser_Firefox | browser_Safari | smoker_no | smoker_yes | |
|---|---|---|---|---|---|---|---|
| 0 | False | True | False | False | True | False | True |
| 1 | True | False | False | True | False | True | False |
| 2 | True | False | True | False | False | False | True |
Dimensionality reduction: PCA¶
Principle Component Analysis (PCA) finds a projection that captures the largest amount of variation in data. It can be effectively used as a dimensionality reduction technique.
Consider the Iris dataset. It cannot be visualized in a single 2D plot, as it has 4 features. We are going to extract 2 combinations of sepal and petal dimensions to visualize it.
from sklearn.decomposition import PCA
PCA?
Init signature: PCA( n_components=None, *, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', n_oversamples=10, power_iteration_normalizer='auto', random_state=None, ) Docstring: Principal component analysis (PCA). Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space. The input data is centered but not scaled for each feature before applying the SVD. It uses the LAPACK implementation of the full SVD or a randomized truncated SVD by the method of Halko et al. 2009, depending on the shape of the input data and the number of components to extract. With sparse inputs, the ARPACK implementation of the truncated SVD can be used (i.e. through :func:`scipy.sparse.linalg.svds`). Alternatively, one may consider :class:`TruncatedSVD` where the data are not centered. Notice that this class only supports sparse inputs for some solvers such as "arpack" and "covariance_eigh". See :class:`TruncatedSVD` for an alternative with sparse data. For a usage example, see :ref:`sphx_glr_auto_examples_decomposition_plot_pca_iris.py` Read more in the :ref:`User Guide <PCA>`. Parameters ---------- n_components : int, float or 'mle', default=None Number of components to keep. if n_components is not set all components are kept:: n_components == min(n_samples, n_features) If ``n_components == 'mle'`` and ``svd_solver == 'full'``, Minka's MLE is used to guess the dimension. Use of ``n_components == 'mle'`` will interpret ``svd_solver == 'auto'`` as ``svd_solver == 'full'``. If ``0 < n_components < 1`` and ``svd_solver == 'full'``, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components. If ``svd_solver == 'arpack'``, the number of components must be strictly less than the minimum of n_features and n_samples. Hence, the None case results in:: n_components == min(n_samples, n_features) - 1 copy : bool, default=True If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead. whiten : bool, default=False When True (False by default) the `components_` vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances. Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions. svd_solver : {'auto', 'full', 'covariance_eigh', 'arpack', 'randomized'}, default='auto' "auto" : The solver is selected by a default 'auto' policy is based on `X.shape` and `n_components`: if the input data has fewer than 1000 features and more than 10 times as many samples, then the "covariance_eigh" solver is used. Otherwise, if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient "randomized" method is selected. Otherwise the exact "full" SVD is computed and optionally truncated afterwards. "full" : Run exact full SVD calling the standard LAPACK solver via `scipy.linalg.svd` and select the components by postprocessing "covariance_eigh" : Precompute the covariance matrix (on centered data), run a classical eigenvalue decomposition on the covariance matrix typically using LAPACK and select the components by postprocessing. This solver is very efficient for n_samples >> n_features and small n_features. It is, however, not tractable otherwise for large n_features (large memory footprint required to materialize the covariance matrix). Also note that compared to the "full" solver, this solver effectively doubles the condition number and is therefore less numerical stable (e.g. on input data with a large range of singular values). "arpack" : Run SVD truncated to `n_components` calling ARPACK solver via `scipy.sparse.linalg.svds`. It requires strictly `0 < n_components < min(X.shape)` "randomized" : Run randomized SVD by the method of Halko et al. .. versionadded:: 0.18.0 .. versionchanged:: 1.5 Added the 'covariance_eigh' solver. tol : float, default=0.0 Tolerance for singular values computed by svd_solver == 'arpack'. Must be of range [0.0, infinity). .. versionadded:: 0.18.0 iterated_power : int or 'auto', default='auto' Number of iterations for the power method computed by svd_solver == 'randomized'. Must be of range [0, infinity). .. versionadded:: 0.18.0 n_oversamples : int, default=10 This parameter is only relevant when `svd_solver="randomized"`. It corresponds to the additional number of random vectors to sample the range of `X` so as to ensure proper conditioning. See :func:`~sklearn.utils.extmath.randomized_svd` for more details. .. versionadded:: 1.1 power_iteration_normalizer : {'auto', 'QR', 'LU', 'none'}, default='auto' Power iteration normalizer for randomized SVD solver. Not used by ARPACK. See :func:`~sklearn.utils.extmath.randomized_svd` for more details. .. versionadded:: 1.1 random_state : int, RandomState instance or None, default=None Used when the 'arpack' or 'randomized' solvers are used. Pass an int for reproducible results across multiple function calls. See :term:`Glossary <random_state>`. .. versionadded:: 0.18.0 Attributes ---------- components_ : ndarray of shape (n_components, n_features) Principal axes in feature space, representing the directions of maximum variance in the data. Equivalently, the right singular vectors of the centered input data, parallel to its eigenvectors. The components are sorted by decreasing ``explained_variance_``. explained_variance_ : ndarray of shape (n_components,) The amount of variance explained by each of the selected components. The variance estimation uses `n_samples - 1` degrees of freedom. Equal to n_components largest eigenvalues of the covariance matrix of X. .. versionadded:: 0.18 explained_variance_ratio_ : ndarray of shape (n_components,) Percentage of variance explained by each of the selected components. If ``n_components`` is not set then all components are stored and the sum of the ratios is equal to 1.0. singular_values_ : ndarray of shape (n_components,) The singular values corresponding to each of the selected components. The singular values are equal to the 2-norms of the ``n_components`` variables in the lower-dimensional space. .. versionadded:: 0.19 mean_ : ndarray of shape (n_features,) Per-feature empirical mean, estimated from the training set. Equal to `X.mean(axis=0)`. n_components_ : int The estimated number of components. When n_components is set to 'mle' or a number between 0 and 1 (with svd_solver == 'full') this number is estimated from input data. Otherwise it equals the parameter n_components, or the lesser value of n_features and n_samples if n_components is None. n_samples_ : int Number of samples in the training data. noise_variance_ : float The estimated noise covariance following the Probabilistic PCA model from Tipping and Bishop 1999. See "Pattern Recognition and Machine Learning" by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/met-mppca.pdf. It is required to compute the estimated data covariance and score samples. Equal to the average of (min(n_features, n_samples) - n_components) smallest eigenvalues of the covariance matrix of X. n_features_in_ : int Number of features seen during :term:`fit`. .. versionadded:: 0.24 feature_names_in_ : ndarray of shape (`n_features_in_`,) Names of features seen during :term:`fit`. Defined only when `X` has feature names that are all strings. .. versionadded:: 1.0 See Also -------- KernelPCA : Kernel Principal Component Analysis. SparsePCA : Sparse Principal Component Analysis. TruncatedSVD : Dimensionality reduction using truncated SVD. IncrementalPCA : Incremental Principal Component Analysis. References ---------- For n_components == 'mle', this class uses the method from: `Minka, T. P.. "Automatic choice of dimensionality for PCA". In NIPS, pp. 598-604 <https://tminka.github.io/papers/pca/minka-pca.pdf>`_ Implements the probabilistic PCA model from: `Tipping, M. E., and Bishop, C. M. (1999). "Probabilistic principal component analysis". Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3), 611-622. <http://www.miketipping.com/papers/met-mppca.pdf>`_ via the score and score_samples methods. For svd_solver == 'arpack', refer to `scipy.sparse.linalg.svds`. For svd_solver == 'randomized', see: :doi:`Halko, N., Martinsson, P. G., and Tropp, J. A. (2011). "Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions". SIAM review, 53(2), 217-288. <10.1137/090771806>` and also :doi:`Martinsson, P. G., Rokhlin, V., and Tygert, M. (2011). "A randomized algorithm for the decomposition of matrices". Applied and Computational Harmonic Analysis, 30(1), 47-68. <10.1016/j.acha.2010.02.003>` Examples -------- >>> import numpy as np >>> from sklearn.decomposition import PCA >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) >>> pca = PCA(n_components=2) >>> pca.fit(X) PCA(n_components=2) >>> print(pca.explained_variance_ratio_) [0.9924... 0.0075...] >>> print(pca.singular_values_) [6.30061... 0.54980...] >>> pca = PCA(n_components=2, svd_solver='full') >>> pca.fit(X) PCA(n_components=2, svd_solver='full') >>> print(pca.explained_variance_ratio_) [0.9924... 0.00755...] >>> print(pca.singular_values_) [6.30061... 0.54980...] >>> pca = PCA(n_components=1, svd_solver='arpack') >>> pca.fit(X) PCA(n_components=1, svd_solver='arpack') >>> print(pca.explained_variance_ratio_) [0.99244...] >>> print(pca.singular_values_) [6.30061...] File: ~/opt/anaconda3/envs/DMML24/lib/python3.11/site-packages/sklearn/decomposition/_pca.py Type: ABCMeta Subclasses:
X, y = iris.data, iris.target
In general, scaling (actually, standardizing = rescaling the features such that they have the properties of a standard normal distribution with a mean of zero and a standard deviation of one) is important, for how PCA is performed in sklearn.
PCA is a variance maximization exercize:
- Suppose you describe a group of people in terms of height (in meters) and weight (in kilos).
- Height varies less than weight because of their respective scales.
- PCA might determine that the direction of maximal variance more closely corresponds with the 'weight' axis, if those features are not scaled. This is misleading, as a change in height of one meter can be considered much more important than the change in weight of one kilogram.
In this case, however, we can directly apply PCA to the original dataset as the range of variables is comparable.
pca = PCA(n_components = 0.95)
pca.fit(X)
X_reduced = pca.transform(X)
print("Reduced dataset shape:", X_reduced.shape)
Reduced dataset shape: (150, 2)
pca.components_
array([[ 0.36138659, -0.08452251, 0.85667061, 0.3582892 ],
[ 0.65658877, 0.73016143, -0.17337266, -0.07548102]])
pca.explained_variance_
array([4.22824171, 0.24267075])
pca.explained_variance_ratio_
array([0.92461872, 0.05306648])
plt.scatter(X_reduced[:, 0],
X_reduced[:, 1],
c = y,
cmap = 'Paired')
plt.xlabel('Component #1')
plt.ylabel('Component #2')
plt.show()
import plotly.express as px
df = px.data.iris()
pca = PCA(n_components = 3)
components = pca.fit_transform(X) # notice that we are coupling fit and transform in a single statement
total_var = pca.explained_variance_ratio_.sum() * 100
fig = px.scatter_3d(
components, x = 0, y = 1, z = 2, color = y,
title = f'Total Explained Variance: {total_var:.2f}%',
labels = {'0': 'PC 1', '1': 'PC 2', '2': 'PC 3'},
height = 600
)
fig.show()
exp_var_cumul = np.cumsum(pca.explained_variance_ratio_)
px.area(
x = range(1, exp_var_cumul.shape[0] + 1),
y = exp_var_cumul,
labels = {"x": "# Components", "y": "Explained Variance"}
)
In the Iris dataset, the first principal component explains more than 90% of the variance of the dataset.
from pandas.plotting import scatter_matrix
scatter_matrix(pd.DataFrame(X), alpha = 1, figsize = (10, 10))
plt.show()
pca = PCA(n_components = 4)
transformed_X = pca.fit_transform(X)
scatter_matrix(pd.DataFrame(transformed_X),
alpha = 1,
figsize = (10, 10))
plt.show()
The scatter plots changes after PCA transformation, and suggest how PCA finds uncorrelated components.
Some plotting utilities from above source.
from matplotlib import cm
from matplotlib import colors, colorbar
# plasma does not exist in matplotlib < 1.5
cmap = getattr(cm, "plasma_r", cm.hot_r)
def create_axes(title, figsize=(16, 6)):
fig = plt.figure(figsize=figsize)
fig.suptitle(title)
# define the axis for the first plot
left, width = 0.1, 0.22
bottom, height = 0.1, 0.7
bottom_h = height + 0.15
left_h = left + width + 0.02
rect_scatter = [left, bottom, width, height]
rect_histx = [left, bottom_h, width, 0.1]
rect_histy = [left_h, bottom, 0.05, height]
ax_scatter = plt.axes(rect_scatter)
ax_histx = plt.axes(rect_histx)
ax_histy = plt.axes(rect_histy)
# define the axis for the zoomed-in plot
left = width + left + 0.2
left_h = left + width + 0.02
rect_scatter = [left, bottom, width, height]
rect_histx = [left, bottom_h, width, 0.1]
rect_histy = [left_h, bottom, 0.05, height]
ax_scatter_zoom = plt.axes(rect_scatter)
ax_histx_zoom = plt.axes(rect_histx)
ax_histy_zoom = plt.axes(rect_histy)
# define the axis for the colorbar
left, width = width + left + 0.13, 0.01
rect_colorbar = [left, bottom, width, height]
ax_colorbar = plt.axes(rect_colorbar)
return (
(ax_scatter, ax_histy, ax_histx),
(ax_scatter_zoom, ax_histy_zoom, ax_histx_zoom),
ax_colorbar,
)
def plot_distribution(axes, X, y, hist_nbins=50, title="", x0_label="", x1_label=""):
ax, hist_X1, hist_X0 = axes
ax.set_title(title)
ax.set_xlabel(x0_label)
ax.set_ylabel(x1_label)
# The scatter plot
colors = cmap(y)
ax.scatter(X[:, 0], X[:, 1], alpha=0.5, marker="o", s=5, lw=0, c=colors)
# Removing the top and the right spine for aesthetics
# make nice axis layout
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.get_xaxis().tick_bottom()
ax.get_yaxis().tick_left()
ax.spines["left"].set_position(("outward", 10))
ax.spines["bottom"].set_position(("outward", 10))
# Histogram for axis X1 (feature 5)
hist_X1.set_ylim(ax.get_ylim())
hist_X1.hist(
X[:, 1], bins=hist_nbins, orientation="horizontal", color="grey", ec="grey"
)
hist_X1.axis("off")
# Histogram for axis X0 (feature 0)
hist_X0.set_xlim(ax.get_xlim())
hist_X0.hist(
X[:, 0], bins=hist_nbins, orientation="vertical", color="grey", ec="grey"
)
hist_X0.axis("off")
def make_plot(title, data):
ax_zoom_out, ax_zoom_in, ax_colorbar = create_axes(title)
axarr = (ax_zoom_out, ax_zoom_in)
# full data
plot_distribution(
axarr[0],
data,
y,
hist_nbins=200,
x0_label=feature_mapping[features[0]],
x1_label=feature_mapping[features[1]],
title="Full data",
)
# zoom-in
zoom_in_percentile_range = (0, 99)
cutoffs_X0 = np.percentile(data[:, 0], zoom_in_percentile_range)
cutoffs_X1 = np.percentile(data[:, 1], zoom_in_percentile_range)
non_outliers_mask = np.all(data > [cutoffs_X0[0], cutoffs_X1[0]], axis=1) & np.all(
data < [cutoffs_X0[1], cutoffs_X1[1]], axis=1
)
plot_distribution(
axarr[1],
data[non_outliers_mask],
y[non_outliers_mask],
hist_nbins=50,
x0_label=feature_mapping[features[0]],
x1_label=feature_mapping[features[1]],
title="Zoom-in",
)
# colorbar
norm = colors.Normalize(y_full.min(), y_full.max())
colorbar.ColorbarBase(
ax_colorbar,
cmap=cmap,
norm=norm,
orientation="vertical",
label="Color mapping for values of y",
)
Fetch the california housing datasets using the sklearn utility.
from sklearn.datasets import fetch_california_housing
dataset = fetch_california_housing()
feature_names = dataset.feature_names
feature_mapping = {
"MedInc": "Median income in block group", # <------- !
"HousAge": "Median house age in block group",
"AveRooms": "Average number of rooms per household",
"AveBedrms": "Average number of bedrooms per household",
"Population": "Block group population",
"AveOccup": "Average house occupancy (number of household members)", # <------- !
"Latitude": "Block group latitude",
"Longitude": "Block group longitude",
}
from sklearn.preprocessing import minmax_scale
X_full, y_full = dataset.data, dataset.target
y = minmax_scale(y_full) # scale the output between 0 and 1 for the colorbar
X_full.shape
(20640, 8)
A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).
A household is a group of people residing within a home. Since the average number of rooms and bedrooms in this dataset are provided per household, these columns may take surpinsingly large values for block groups with few households and many empty houses, such as vacation resorts.
The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000).
df = pd.DataFrame(np.concatenate((dataset.data,
np.expand_dims(dataset.target, -1)),
axis = 1),
columns = dataset.feature_names + ['MedianHouseVal'])
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| MedInc | 20640.0 | 3.870671 | 1.899822 | 0.499900 | 2.563400 | 3.534800 | 4.743250 | 15.000100 |
| HouseAge | 20640.0 | 28.639486 | 12.585558 | 1.000000 | 18.000000 | 29.000000 | 37.000000 | 52.000000 |
| AveRooms | 20640.0 | 5.429000 | 2.474173 | 0.846154 | 4.440716 | 5.229129 | 6.052381 | 141.909091 |
| AveBedrms | 20640.0 | 1.096675 | 0.473911 | 0.333333 | 1.006079 | 1.048780 | 1.099526 | 34.066667 |
| Population | 20640.0 | 1425.476744 | 1132.462122 | 3.000000 | 787.000000 | 1166.000000 | 1725.000000 | 35682.000000 |
| AveOccup | 20640.0 | 3.070655 | 10.386050 | 0.692308 | 2.429741 | 2.818116 | 3.282261 | 1243.333333 |
| Latitude | 20640.0 | 35.631861 | 2.135952 | 32.540000 | 33.930000 | 34.260000 | 37.710000 | 41.950000 |
| Longitude | 20640.0 | -119.569704 | 2.003532 | -124.350000 | -121.800000 | -118.490000 | -118.010000 | -114.310000 |
| MedianHouseVal | 20640.0 | 2.068558 | 1.153956 | 0.149990 | 1.196000 | 1.797000 | 2.647250 | 5.000010 |
df[['MedInc','AveOccup']].plot(kind = 'box',
subplots = True,
figsize = (8, 5),
layout = (1, 2),
sharey = False)
plt.show()
Notice that the maximum value of "Average Occupancy" (AveOccup) is far beyond the value of the third quartile.
features = ["MedInc", "AveOccup"]
features_idx = [feature_names.index(feature) for feature in features]
X = X_full[:, features_idx]
We will analyze the behaviour of the following scalers:
- StandardScaler
- MinMaxScaler
- RobustScaler
Notice that all of them operate on the various features independently.
Plot structure:
- left plot: entire dataset
- right plot: zoomed-in plot to show the dataset without the marginal outliers (keeping data within 99-percentile)
make_plot("Unscaled data", X)
A large majority of the samples are compacted to a specific range, [0, 10] for the median income and [0, 6] for the average house occupancy. Note that there are some marginal outliers (some blocks have average occupancy of more than 1200).
Standard Scaler¶
- The motivation to use this scaling lies in the fact that many machine learning estimators might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance. In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
make_plot("Data after standard scaling", X_scaled)
scaler.mean_, scaler.scale_
(array([3.870671 , 3.07065516]), array([ 1.89977569, 10.38579796]))
X_scaled.mean(axis = 0), X_scaled.std(axis = 0)
(array([6.60969987e-17, 3.44255201e-18]), array([1., 1.]))
The StandardScaler removes the mean and scales the data to unit variance. The scaling shrinks the range of the feature values as shown in the left figure below.
Due to outliers magnitude, however, most of the data lie in the [-2, 4] range for the transformed median income feature while the same data is squeezed in the smaller [-0.2, 0.2] range for the transformed average house occupancy. StandardScaler therefore cannot guarantee balanced feature scales in the presence of outliers.
MinMax Scaler¶
- The motivation to use this scaling include robustness to very small standard deviations of features and preserving zero entries in sparse data.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
make_plot("Data after minmax scaling", X_scaled)
X_scaled.min(axis = 0), X_scaled.max(axis = 0)
(array([0., 0.]), array([1., 1.]))
The MinMaxScaler rescales the data set such that all feature values are in
the range [0, 1].
However, this scaling compresses all inliers into the narrow range [0, 0.005] for the transformed average house occupancy.
Both StandardScaler and MinMaxScaler are very sensitive to the
presence of outliers.
Robust Scaler¶
- Whenever data contains many outliers, scaling using the mean and variance of the data is likely to not work very well (same for MinMax scaling). In these cases, RobustScaler can be used as it exploits more robust estimates for the center and range of data.
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler(quantile_range = (25.0, 75.0))
X_scaled = scaler.fit_transform(X)
make_plot("Data after robust scaling", X_scaled)
scaler.center_, scaler.scale_
(array([3.5348 , 2.81811565]), array([2.17985 , 0.85251978]))
The RobustScaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).
The centering and scaling statistics of RobustScaler are based on percentiles and are therefore not influenced by a small number of very large marginal outliers.
Consequently, the resulting range of the transformed feature values is larger than for the previous scalers and, more importantly, are approximately similar: for both features most of the transformed values lie in a [-2, 3] range as seen in the zoomed-in figure.
Note that the outliers themselves are still present in the transformed data. If a separate outlier clipping is desirable, a non-linear transformation is required (e.g., by using QuantileTransformer).